Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Nov 25 14:37:51 2020"
The text continues here.
Never thought that I was actually going to do stuff in R. Up until now, I have done most of my work in Python. I feel excited, but at the same time a bit worried that this will be very much work. Everything will probably be alright, and I will have gained a lot of useful knowledge by the end of this course.
I found this course while desperately looking for the final study credits that I need to finish my doctoral studies. I have had plenty of struggle trying to find something that actually seems useful for my future. This just might be it.
My github repo: https://github.com/CurlyNikolai/IODS-project
date()
## [1] "Wed Nov 25 14:37:51 2020"
The data set we are going to use for this exercise is found in the data folder named learning2014.txt.
So let’s read the file and move on to presenting it.
lrn14 <- read.csv("data/learning2014.txt")
The dataset presents results from a questionnaire where participants answered a set of questions regarding deep, surface and strategic learning. The dataset holds information on participants gender, age, attitude and exam points.
The structure of the data is as follows:
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
And the dimensions:
dim(lrn14)
## [1] 166 7
The following graph shows a nice representation of all of the data, colour coded according to gender. We can see all of the variables plotted against each other as scatter plots, the independent variables distributions, and the correlations between variables.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(lrn14, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The summary of all variables in the data can be viewed below.
summary(lrn14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
For this part we will be fitting a regression model by using the exam points as our target variable, and the ""“!”#“#¤!”¤¤#R!"¤ as explanatory variables (mainly because I did not want to repeat what was done in datacamp). Below the model is created, and its summary printed out.
model <- lm(points ~ gender + age + attitude, data = lrn14)
summary(model)
##
## Call:
## lm(formula = points ~ gender + age + attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## genderM -0.33054 0.91934 -0.360 0.720
## age -0.07586 0.05367 -1.414 0.159
## attitude 3.60657 0.59322 6.080 8.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
As we can see above in the summary, the gender and attitude are not significant in modelling the exam points (which is a good thing!), when compared to the attitude. So let’s remove gender and age from the model, essentially making the model a linear one:
model <- lm(points ~ attitude, data = lrn14)
We can also easily plot this:
qplot(attitude, points, data = lrn14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
### 2.4 Relationship between exam points and attitude
The model can be summarized as follows:
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
As stated earlier, there was a correlation between attitudes and exam points, due to the low p-value. The residuals are quite high, meaning that we have quite spread out data, which can also be seen from the model plot earlier. Due to the large spread, the multiple R-squared value is also quite low. But this does not mean necessarily that it is a bad model!
### 2.5 Model validation
The residuals vs fitted values, normal qq-plot and residuals vs leverage plots can be viewed below:
par(mfrow=c(2,2))
plot(model, which=c(1,2,5))
The residuals vs leverage plot seems to be in such small scale on the x-axis, that everything seems fine. The same can be said of the residuals vs fitted plot, where the data seem quite evenly clumped together. However, from the qq-plot we can see that the data deviates somewhat in the beginning and end from the line. This means that the normality assumption questionable.
date()
## [1] "Wed Nov 25 14:37:57 2020"
The data is read in and the variable names presented in the below code block.
alc <- read.csv("data/alc.txt", header=TRUE, sep=",")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data presents the alcohol consumption of individuals, and their grades from three different periods, for persons with various other attributes. The grades are stored in the G1, G2 and G3 variables, and the alcohol consumption are stored in the Dalc, Walc, alc_use and high_use variables. Dalc and Walc separate into alcohol consumption during weekdays and weekends, repsectively, while alc_use is their mean value and high_use indicates if the alcohol use is excessive. The rest of the variables hold different information. For example, the sex, family size and many more.
The variables age, sex, romantic and famrel variables were chosen for this analysis. These were chosen to gain insight in the social impact on an individuals alcohol consumption. Age, quite obviously does have an effect, due to the restriction of alcohol consumption for minors, while the gender, romantic and famrel variables can show insight in what kind of an effect social life has on an individuals alcohol consumption. One might think, for example, that a single person would go out more in search of a partner, in the bar setting, or that a bad relationship with ones family might raise ones thirst. These of course do not have to be true, but we are here to investigate whether there is any statistical significance to these ideas.
The chosen variables age, famrel, romantic and sex, as well as the variables alc_use and high_use are presented as bar charts below.
library(tidyr); library(dplyr); library(ggplot2); library(GGally)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc_chosen <- select(alc, one_of(c("sex", "age", "romantic", "famrel", "alc_use", "high_use")))
gather(alc_chosen) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
Let’s do a similar summary plot as in the previous exercise:
ggpairs(alc_chosen, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
Let’s fit a logistic regression model to our data. we’ll set the high_use variables as our target variable.
m <- glm(high_use ~ sex + age + romantic + famrel, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + age + romantic + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4907 -0.8563 -0.6465 1.1923 2.0815
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.0662 1.6976 -2.395 0.01661 *
## sexM 0.9319 0.2363 3.943 8.04e-05 ***
## age 0.2462 0.0994 2.477 0.01326 *
## romanticyes -0.2524 0.2570 -0.982 0.32606
## famrel -0.3330 0.1258 -2.648 0.00811 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 438.06 on 377 degrees of freedom
## AIC: 448.06
##
## Number of Fisher Scoring iterations: 4
Above we can see the summary of our model. From this information it would seem that the only variable not significant for determining the high/low use is the romantic variable.Let’s then calculate the coefficient odds ratios and the confident intervals:
OR <- coef(m) %>% exp
CI <- exp(confint(m)) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01714253 1.000588 1.595784
## sexM 2.53936624 4.982578 58.122313
## age 1.27914516 2.869703 4.751577
## romanticyes 0.77696694 1.592648 3.587577
## famrel 0.71676417 1.748835 2.501371
Due to lack of time, and other deadlines I had to cut my work short here for this week :(
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Wed Nov 25 14:38:03 2020"
library(GGally)
library(ggplot2)
Let’s begin by loading the Boston data from the MASS package:
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
Let’s take a look at the dimensions of the data:
dim(Boston)
## [1] 506 14
So we can see that there are 14 variables, and 506 observations. Then let’s look at the structure of the data:
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The data shows information on housing in an are called Boston Mass, gathered by the U.S Census Service (link). Each observation holds information on a Boston suburb or town with 14 variables. The variables include for example the per capita crime rate (“crim”), average number of rooms per dwelling (“RM”) and a pupil-teacher ratio (“PTRATIO”). For a full description of the data, please read for example this link.
Let’s change the “chas” varable into a categorical one, since it takes values of 0 or 1 (just for plotting, we’ll change is back to numeric in a bit).
Boston$chas = as.factor(Boston$chas)
Let’s plot all of the variables against each other and their correlations:
p1 <- ggpairs(Boston, mapping=aes(), diag=list(continuous="barDiag"), lower=list(continuous="smooth"))
suppressMessages(print(p1))
From the overwhelming graph above we can see many different plots, such as the distributions of each variable on the diagonal. The correlations are on the upper side of the plot, while the scatter plots between each variable are on the bottom side of the plot.
For example when comparing the nitric oxide concentration with units built before 1940, we can see that a higher proportion of old units leads to a rise in the nitric oxide concentration. Another example is that we can see with an increasing median-value of owner occupied homes and increase in the average amount of rooms per dwelling, which would make sense in terms of a more expensive hous having more rooms in it.
Let’s print out the summaries of each variable:
summary(Boston)
## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
## rm age dis rad
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## tax ptratio black lstat
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
From here we can for example see that high crime rates are rare than low ones, if we consider that the max value is 89.0, and the median is at 0.3. I could not find information on what exact number of people the crime per capita is, only that it is the crime per capita per town. I guess it has to still be divided by 100 000 or something, otherwise it can’t make sense.
Let’s standardize the data by scaling it with the scale() function (also turn the “chas” variable back to numeric for the scaling), and print out the summary of the scaled data:
Boston$chas = as.numeric(Boston$chas)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
From the summary of the scaled data, we can see that all variables have a new mean at 0.0, meaning that we have centered all of it. Now let’s transform the crime rate into a categorical variable, using the quantiles as breaking points (just as in the datacamp exercises), so that the crime rate is given as “low”, “med_low”, “med_high” and “high”:
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Now let’s create test and training sets of the data, just like we did in the datacamp exercises. We’ll pick randomly 80% of the data for the training set, and the rest goes to the testing set:
set.seed(4313) #set seed to obtain same result in predictions (for writing purposes, comment out if you want different results)
n <- nrow(boston_scaled) #number of rows
ind <- sample(n, size = n * 0.8) #random indices of 80% of the datapoints
train <- boston_scaled[ind,] #create training data
test <- boston_scaled[-ind,] #create testing data
correct_classes <- test$crime #Store the correct classes from test data in its own variable
test <- dplyr::select(test,-crime) #Remove the correct classes from the test data
Now that we have our test and training sets of the scaled boston data, let’s fit a linear discriminant analysis on the training set. The categorical crime rate will work as out target variable, while all other variables are predictors. After the fitting, let’s plot the LDA with the help of the arrow function from the datacamp exercise:
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 5)
For some reason the x-axis seems to be flipped, when comparing with the datacamp plot.
We already removed the correct classes from the test data, and stored them in their own variable, in the previous section. Let’s now predict the classes with our LDA model, and cross tabulate our predictions with the correct classes:
lda.pred <- predict(lda.fit, newdata=test)
table(correct=correct_classes, predicted=lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 6 0 0
## med_low 7 15 6 0
## med_high 1 14 15 0
## high 0 0 0 19
We can see from the crosstabulation that we have 34 wrongly predicted classes out of a 102. In percentages we would have a success rate of about 66.6% with out model, which isn’t too bad with such a small dataset, but still could be better. The results reported in the previous sentence is valid if the seed for the random selection of data for training and test data sets hasn’t changed.
Let’s reload and standardize the Boston dataset:
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
Now we’ll calculate the manhattan distance matrix for the scaled dataset:
dist <- dist(Boston, method="manhattan")
summary(dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
Then let’s run the k-means algorithm on our dataset, starting with three cluster centers, and plot the clustering for some of the variable plots:
km <- kmeans(Boston, centers = 3)
pairs(Boston[6:10], col=km$cluster)
From this it is difficult to determine whether the number of clusters is optimal or no. Simply by changing the clusters number and replotting, is probably not going to help either, so let’s investigate the optimal number of clusters by plotting the within cluster sum of squares (WCSS) against the number of clusters. The optimal number of clusters should be found after the WCSS has dropped radically.
twcss <- sapply(1:10, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x=1:10, y=twcss, geom='line')
From the plot we can see that the WCSS has dropped radically after x=2, so that is our optimal cluster number. So let’s rerun the k-means algorithm with this optimal number of clusters and plot some of the results.
km <- kmeans(Boston, centers = 2)
pairs(Boston[6:10], col=km$cluster)
pairs(Boston[7:14], col=km$cluster)
The k-means algorithm seems to do well at least most of the very clear cases where there are two clusters in the data. In some of the plots there are some qurious data points that seem to be clearly in the wrong cluster, see for example some of the “rad” plots where there are a couple of black data points inside the red cluster.
That’s it for this week, sadly I have no more time for the bonus exercises. See you next week!
date()
## [1] "Wed Nov 25 14:38:20 2020"
library(GGally)
library(ggplot2)
library(FactoMineR)
library(dplyr)
library(tidyr)
Let’s load the human data, plot the variables and look at their summaries:
human <- read.csv("data/human.dat", header=TRUE, row.names=1, sep=",")
p1 <- ggpairs(human, mapping=aes(), lower=list(continuous="smooth"))
suppressMessages(print(p1))
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
From the diagonal in the matrix plot we can see the distributions of each variable. For example we can see that the life expectancy peaks at less than 80 years, and that the expected years of education peaks at less than 15. The scatter plots show the inter-variable dependencies, with a linear fit included. The linear fit seems descriptive for some of the plots, like for the expected years of education vs. life expectancy, but not for others, such as in the ratio of education between females and males vs. the ratio of labour force between females and males.
In the summary of the data we can see the detailed information on the variable means, quantiles, as well as their minima and maxima. We can for example see the mean life expectancy to be 71.65 years, and the mean percentage of female representation in parliament at 20.91 percent.
Let’s run PCA on our data, while its not standardized:
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03 0.0022190154
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02 0.1431180282
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02 0.9865644425
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05 -0.0001135863
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03 0.0266373214
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03 0.0188618600
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
## PC6 PC7 PC8
## Edu2.FM 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 3.264423e-02 7.132267e-01 -7.001533e-01
## Edu.Exp 9.882477e-01 -3.826887e-02 7.776451e-03
## Life.Exp -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -2.309896e-02 -2.642548e-03 2.680113e-03
We can see the variability of each principal component in the prinout of the pca_human variable. There we can see that the first two principal components have the largest variability. Let’s do a biplot with the first two principal components as out x- and y-axes:
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
We can see that the non-standardized data is very difficult to read from this plot, and all of the arrows seem to point parallel with the first principal component, as if it would hold all of the variance. On the axes, the percentages which the principal components capture are shown, and PC1 appears to be capturing 100%, which is wrong.
Now let’s standardize the data, run the PCA analysis, and look at the biplot again.´
human_std <- scale(human)
pca_human <- prcomp(human_std)
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Now, we can see a better capturing of the variance between the first two principal components. The first one captures 53.6, while the second captures 16.2%, while the rest is reserved for the rest of the components not see in this plot. The scaling of the variables seem to be important for the PCA of this data. From the arrows we can see that there is a high positive correlation between the variables Edu.Exp, Edu2.FM, Life.Exp and GNI, a high positive correlation between the variables Parli.F and Labo.FM, and a high positive correlation between Mat.Mor and Ado.Birth. We can also see that PC1 captures most of its variance from Edu.Exp, Edu2.FM, Life.Exp, GNI, Mat.Mor and Ado.Birth, while PC2 captures most of its variance from Parli.F and Labo.FM
It would seem that life expectancy, expected years of education, ratio of education between females and males are strongly correlated. This would make sense when thinking that countries with higher level of education would have longer life expectancy, as well as a larger gross national income. If we look at the countries on the side these variable arrows point to, we can see countries that have been classified as “developed”, such as Norway, Australia and Japan. On the other end of the arrows we have countries such as Sudan, Sierra Leone and Afghanistan that have been classified as “third-world” countries that have at some point been under occupation by western countries. The strongly correlated maternal mortality and adolescent birthrate are also correlated, and if we view the graphical overview of the data, it would seem that a higher adolescent birthrate might lead to a higher maternal mortality, Now we can see the reverse trend that these arrows point to the less developed countries, while the more developed countries are in the other end. The female representation in parliament, and female/male labour ratio are correlated and point upwards where we can find countries such as Norway, Cuba and Rwanda, while in the other end there are countries such as United Arab Emirate and Yemen, where women have less rights.
Let’s load the tea data from the FactoMineR package and print its summary, structure and dimensions. Let’s also restrict ourselves to the same variables as in the datacamp exercises, and plot their bar plots:
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300 6
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Then let’s run MCA on our data:
mca <- MCA(tea_time, graph= FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage="quali")
From the above plot we can see the similarities between the variables. E.g. the unpackaged and tea shop variables are similar to each other, but not to the rest of the variables.
(more chapters to be added similarly as we proceed with the course!)